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Abstract.  The  performance  of  a  reject  option  classifiers  is  quantified 
using  0  —  d  —  1  loss  where  d  £  (0,  .5)  is  the  loss  for  rejection.  In  this 
paper,  we  propose  double  ramp  loss  function  which  gives  a  continuous 
upper  bound  for  (0  —  d  —  1)  loss.  Our  approach  is  based  on  minimizing 
regularized  risk  under  the  double  ramp  loss  using  difference  of  convex 
programming.  We  show  the  effectiveness  of  our  approach  through  exper¬ 
iments  on  synthetic  and  benchmark  datasets.  Our  approach  performs 
better  than  the  state  of  the  art  reject  option  classification  approaches. 


1  Introduction 

The  primary  focus  of  classification  problems  has  been  on  algorithms  that  return 
a  prediction  on  every  example.  However,  in  many  real  life  situations,  it  may  be 
prudent  to  reject  an  example  rather  than  run  the  risk  of  a  costly  potential  mis- 
classification.  Consider,  for  instance,  a  physician  who  has  to  return  a  diagnosis 
for  a  patient  based  on  the  observed  symptoms  and  a  preliminary  examination.  If 
the  symptoms  are  either  ambiguous,  or  rare  enough  to  be  unexplainable  without 
further  investigation,  then  the  physician  might  choose  not  to  risk  misdiagnosing 
the  patient.  He  might  instead  ask  for  further  medical  tests  to  be  performed,  or 
refer  the  case  to  an  appropriate  specialist.  The  principal  response  in  these  cases 
is  to  “reject”  the  example.  This  paper  focuses  on  learning  a  classifier  with  a 
reject  option.  From  a  geometric  standpoint,  we  can  view  the  classifier  as  being 
possessed  of  a  decision  surface  as  well  as  a  rejection  surface.  The  rejection  region 
impacts  the  proportion  of  examples  that  are  likely  to  be  rejected,  as  well  as 
the  proportion  of  predicted  examples  that  are  likely  to  be  correctly  classified. 
A  well-optimized  classifier  with  a  reject  option  is  the  one  which  minimizes  the 
rejection  rate  as  well  as  the  mis-classification  rate  on  the  predicted  examples. 

Let  x  £  Rp  is  the  feature  vector  and  y  £  { — 1,  +1}  is  the  class  label.  Let 
T>(x,  y)  be  the  joint  distribution  of  x  and  y.  A  typical  reject  option  classifier  is 
defined  using  a  bandwidth  parameter  (p)  and  a  separating  surface  (/(x)  =  0). 
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p  is  the  parameter  which  determines  the  rejection  region.  Then  a  reject  option 
classifier  h(f(x),p)  is  formed  as: 

M/(X);P)  =  l-I{/(x)>p}  +  0.1{|/(x)|<p}  —  l-I{/(x)<-p}  (1) 

where  is  an  indicator  function  which  takes  value  1  if  predicate  ’A’  is  true, 
else  0.  The  reject  option  classifier  can  be  viewed  as  two  parallel  surfaces  with  the 
rejection  area  in  between.  The  goal  is  to  determine  /(x)  as  well  as  p  simultane¬ 
ously.  The  performance  of  this  classifier  is  evaluated  using  Lo-d-i  [8,12]  which 
is 


Lq- d—  l(/(x),  V:  P)  l'^{y/(x)<  —  p}  A  C?.I{|/(x)|<p}  +  0-fl{j//(x)>-p}  (2) 

In  the  above  loss,  d  is  the  cost  of  rejection.  If  d  =  0,  then  we  will  always  reject. 
When  d  >  .5,  then  we  will  never  reject  (because  expected  loss  of  random  labeling 
is  0.5).  Thus,  we  always  take  d  £  (0,  .5). 

To  learn  a  reject  option  classifier,  the  expectation  of  Lo_d-i(-, .)  with 
respect  to  ’D(x1y)  (risk)  is  minimized.  Since  T>(x,y)  is  fixed  but  unknown,  the 
empirical  risk  minimization  principle  is  used.  The  risk  under  Lo-d-i  is  mini¬ 
mized  by  generalized  Bayes  discriminant  [4,8].  h(f(x),p)  (Eq.  (1))  is  shown  to 
be  infinite  sample  consistent  with  respect  to  the  generalized  Bayes  classifier  [13]. 


Table  1.  Convex  surrogates  for  Lo-d-i 


Loss  Function 

Definition 

Generalized  Hinge 

£gh(/(x),j/)  =  < 

’l-^W/(x),  if  y/(x)  <  0 

1  -y/(x),  if  0  <  2// (x)  <  1 

0,  otherwise 

Double  Hinge 

1dh(/(x),  y)  =  max]— i/(l  -  d)  f  (x.)  +  H (d) ,  -ydf  (x)  +  H (d) ,  0] 
where  H(d)  =  —  dlog(d)  —  (1  —  d)  log(l  —  d ) 

Since  minimizing  the  risk  under  L0-d-i  is  computationally  cumbersome, 
convex  surrogates  for  Lo-d-i  have  been  proposed.  Generalized  hinge  loss  Tgh 
(see  Table  1)  is  a  convex  surrogate  for  Lo-d-i  [3,12].  It  is  shown  that  a  min- 
imizer  of  risk  under  Lgh  is  consistent  to  the  generalized  Bayes  classifier  [3]. 
Double  hinge  loss  Tdh  (see  Table  1)  is  another  convex  surrogate  for  Lo-d-i 
[7].  Minimizer  of  the  risk  under  Ldh  is  shown  to  be  strongly  universally  con¬ 
sistent  to  the  generalized  Bayes  classifier  [7].  We  observe  that  these  convex  loss 
functions  have  some  limitations.  For  example,  Lgh  is  a  convex  upper  bound  to 
Ao-d-i  provided  p  <1  —  d  and  Lqh  forms  an  upper  bound  to  Lo_<j_i  provided 
p  £  C\HI'd'> ;  Hl'd^d)  (see  Fig.  1).  Also,  both  Lgh  and  Ldh  increase  linearly 
in  the  rejection  region  instead  of  remaining  constant.  These  convex  losses  can 
become  unbounded  for  misclassified  examples  with  the  scaling  of  parameters  of 
/.  Moreover,  limited  experimental  results  are  shown  to  validate  the  practical 
significance  of  these  losses  [3,7,12].  A  non-convex  formulation  for  learning  reject 
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Fig.  1.  Lgh  and  Ldh  for  d  =  0.2.  (a)  For  p  =  0.7,  both  the  losses  upper  bound  the 
Z/o-d-i-  For  p  =  2,  both  the  losses  fail  to  upper  bound  Lo-d-i-  Lgh  and  Ldh  both 
increase  linearly  even  in  the  rejection  region  than  being  flat. 

option  classifier  is  proposed  in  [5].  However,  theoretical  guarantees  for  the  app¬ 
roach  proposed  in  [5]  are  not  known.  While  learning  a  reject  option  classifier, 
one  has  to  deal  with  the  overlapping  class  regions  and  outliers.  SVM  and  other 
convex  loss  based  approaches  are  less  robust  to  label  noise  and  outliers  in  the 
data  [10].  It  is  shown  that  ramp  loss  based  approach  is  more  robust  to  noise  [6]. 

Motivated  by  this,  we  propose  double  ramp  loss  (Ldr)  which  incorporates  a 
different  loss  value  for  rejection.  Ldr  forms  a  continuous  nonconvex  upper  bound 
for  Lo-d-i  and  overcomes  many  of  the  issues  of  convex  surrogates  of  Lo-d-i- 
To  learn  a  reject  option  classifier,  we  minimize  the  regularized  risk  under  Ldr 
which  becomes  an  instance  of  difference  of  convex  (DC)  functions.  To  minimize 
it,  we  use  DC  programming  approach  [1],  The  proposed  method  has  following 
advantages:  (1)  the  proposed  loss  Ldr  gives  a  tighter  upper  bound  to  the  L0-d-i, 
(2)  Ldr  requires  no  constraint  on  p  unlike  Lgh  and  Ldh,  (3)  our  approach  can 
be  easily  kernelized  for  dealing  with  nonlinear  problems. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2  we  define  the  double 
ramp  loss  (Ldr).  Then  we  discuss  its  properties  and  the  proposed  formulation 
based  on  Ldr.  In  Section  3  we  derive  the  (Ldr)  based  reject  option  classifier 
learning  algorithm.  We  present  experimental  results  in  Section  4.  We  conclude 
the  paper  with  the  discussion  in  Section  5. 
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2  Proposed  Approach 

Our  approach  for  learning  classifier  with  reject  option  is  based  on  minimizing 
regularized  risk  under  Ldr  (double  ramp  loss). 
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2.1  Double  Ramp  Loss 

Double  ramp  loss  is  defined  as  a  sum  of  two  ramp  loss  functions  as  follows: 

£dr(/(x),  y,  p)  =  -  [p-  yf(x)  +  p\  +  -  [  -  p2  -  y/(x)  +  p]  _ 

P  L 


(1-d) 


[p  -  2//(x)  -  p]  ,  -  [-V-y/OO-p],  (3) 
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Fig.  2.  Ldr  and  Lo-d-i  :  Vp  >  0,p  >  0,  Ldr  is  an  upper  bound  for  Lo-d-i 

where  [a]+  =  max(0,a).  p  G  (0,1]  defines  the  slope  of  ramps  in  the  loss1. 
Parameter  p  defines  the  width  of  the  rejection  region.  Fig.  2  shows  Ldr  for 
d  =  0.2,  p  =  2  for  different  p. 

Theorem  1.  (i)  LDR  >  L0_d-i,Vp  >  0,p  >  0.  (ii)  limM^0  LDR(f{x),  p,  y)  = 
L0_d_i(/(x),p,j/).  (in)  In  the  rejection  region,  yf(x)  £  (p  -  p2,-p  +  p), 
LDR(f(x),y,p)  =  d(l  +  p),  a  const,  (iv)  LDR  <  (1  +  p),\/p  >  0 ,d  >  0.  (v) 
When  p  =  0,  LRR  is  same  as  p-ramp  loss  ([11])-  (vi)  LRR  is  a  non-convex 
function  of  (yf(x),p). 

The  proof  of  Theorem  1  is  omitted  due  to  the  space  constraints.  We  see  that 
Ldr  does  not  put  any  restriction  on  p  for  it  to  be  an  upper  bound  of  Lo_d_i. 

2.2  Risk  Formulation  Using  Ldr 

Let  S  =  {(xn,j/„),  n  =  1 . .  .N}  be  the  training  dataset,  where  x„  G  Rp,  yn  G 
{— 1,+1},  Vn.  As  discussed,  we  minimize  regularized  risk  under  Ldr  to  find 

1  While  Ldr  is  parametrized  by  p  and  d  as  well,  we  omit  them  for  the  sake  of  notational 

consistency. 
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a  reject  option  classifier.  In  this  paper,  we  use  I2  regularization.  Let  0  = 
[wT  b  p]T .  Thus,  for  /(x)  =  (w T</>(x)  +  b ),  regularized  risk  under  double 
ramp  loss  is 


1  C  .  r 

R(®)  =  ^llwH2  +  —  {4^  “  y«/(x«)  +  P\+  -  d[  -  4  ~  2/n/(x„)  +  p]  + 

+(!  ~d)[n-  y„/(xn)  -  p]  +  -  (l  -  d) [  -  p2  -  ynf(*n)  -  p] + } 

=  ^l|w||2  +  —  {^[/^  -  ynf{*n)  +  p]+  +  (1  -  d)[p  -  y„/(xn)  -  p\  + 

-d[  -  p2  -  y„f(xn)  +  p]+-  (1  -  d)[-  p2  -  y„/(xn)  -  p]+| 

where  C  is  regularization  parameter.  While  minimizing  R(0),  no  non-negativity 
condition  on  p  is  required  due  to  the  following  lemma. 

Lemma  1.  At  the  minimum  of  R(0),  p  must  be  non-negative. 

Proof.  Let  0'  =  (w ',b',p')  minimizes  R(0),  where  p'  <  0.  Thus  —p'  >  0. 
Consider  0"  =  (w',  b' ,  —p')  as  another  point. 

R(O')  -  R{0”)  =  C ^  ^  {  ~  [P  “  V-nfix-n)  +  p]  +  +  [  -  P2  -  2/n /(x„)  +  p]  + 

+  [P-  2/n/(x„)  -  p']+  -  [-  P2  -  Vnf  (xn )  -  p']+j 
N 

—  C(1  2d)  ^  ]  |l/ramp(?/n/(xrl)  “t“  P  )  Lramp  (.Vnf  (xn )  p  )  j1 

where  Lrarnp{t)  =  -([p  —  —  [—  p2  —  t]+)  is  a  monotonically  non-increasing 
function  of  t  [11].  Since  p’  <  0,  thus,  yra/(xn)  +  p'  <  t/n/(xn)  —  p',  Vn.  This 
implies  Lramp(ynf{xn)  +  p')  >  Lramp(ynf(pcn)  -p’),  Vn.  Also  (l-2d)  >  0,  since 
0  <  d  <  0.5.  Thus  R(0')  —  R{0")  >  0,  which  contradicts  that  0’  minimizes 
R(0).  Thus,  at  the  minimum  of  R(0),  p  must  be  non-negative. 

3  Solution  Methodology 

R(0)  (Eq.  (4))  is  a  nonconvex  function  of  0.  However,  R(0)  can  be  written  as 
R{0)  =  i?i(6>)  —  where  R\{0)  and  i?2(@)  are  convex  functions  of  0. 

1  C  r  i 

#i(0)  =  dlwH2  +  —J2  \d[p-ynf(xn)  +  p]+  +  (1  -  d)[p-ynf(xn)  -p] 

Z  V  n= 1 L  J 

C  r  i 

#2(0)  =  —^2  \d,[  -  p2  -  ynf(xn)  +  p\+  +  (1  -  d)[~  p2  -  ynf(xn)  -  p\  + 
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In  this  case,  DC  programming  guarantees  to  find  a  local  optima  of  R(0)  [1], 
In  the  simplified  DC  algorithm  [1],  an  upper  bound  of  R{0)  is  found  using  the 
convexity  property  of  R2(0)  as  follows. 

R(0)  <  Ri(O)  -  R2(0{1))  -  (0  -  0w)TVI?2(0w)  =:  ub{0,  0(l))  (4) 

where  0^  is  the  parameter  vector  after  (l)ih  iteration,  VR2(0^)  is  a  sub¬ 
gradient  of  R2  at  0(l\  0d+!)  is  found  by  minimizing  ub(0,0^).  Thus, 
i?(6>(*+1))  <  m6(0(z+1),0W)  <  ub(0W,(9(O)  =  i?(0(O).  Which  means,  in  every 
iteration,  the  DC  program  reduces  the  value  of  R(0). 


3.1  Learning  Reject  Option  Classifier  Using  DC  Programming 

In  this  section,  we  will  derive  a  DC  algorithm  for  minimizing  R{0).  We  initialize 
with  0  =  Given  0®,  we  find  0(z+1)  as 

0*'i+1^  €  argmin  ub(0,0 =  argmin  R\{0)  —  0TV R2(0^)  (5) 

where  Vd?2(0^)  is  the  subgradient  of  R2(0)  at  0®.  We  choose  Vi?2(0^)  as: 


VR2(0(l))  =  /3n(i)[-2/n(/>(x„)T  -yn  1]T  +  ^  /3n(i)[-2/n</>(xn)T  -  J/„  -l]5 


where 


^  yO'CO  _  Crfu 

Pn  ^  -^{i/n(0(xn)Tw(z)-|-b(O)— p(0<— ^2} 

_  0(1— d)  -[]- 

^Pn  „  \2/n(<KXn)Tw(i)+&(*))+/9^)  <  —  /X2 } 


(6) 


For  /(x)  =  (wT^>(x)  +  6),  we  rewrite  the  upper  bound  minimization  problem 
described  in  Eq.  (5)  as  follows, 


P(i+1)  =  min©  Ri(0)  -  0TVR2(0(l)) 

^  2  C  y  f  "I 

=  mjn  yl|w||2  +  —  V  d[p  -  y„/(x„)  +  p]  +  (1  -  d)  [p  -  yn/(xn)  -  p] 

w,o.p  Z  11  * —  L  '  '  J 

7i=l 

AT  AT 

+  51  /?n(i)[2/n/(x„)  -  p]  +  55  ^n(0[2/n/(x„)  +  p] 

n=l  n= 1 


We  rewrite  p(l+1'>  as 


i  n 

P{1+1)=  min  -||w||2  +  —  55  [d&  +  C1  “  d)C]  +  ^  Pnl)[yn{wT +  6)  -  p] 

w ,&,€  €  ,P  /  ^  n  — 1  n=l 

N 

+  55  /3n(°[2/n(wTb(x„)  +  6)  +  p] 

71  =  1 

s.t.  y„(wT(/)(xn)  +  6)  >  p  +  p  -  £'n  >  0,  n  =  1 ...  IV 

y„(wT(/)(xn)  +  b)  >  -p  +  y,  -  £",  £"  >  0  n  =  1 ...  N 
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where  =  If',  £9  ■  •  ■  £/v]T  and  =  E/  &  ■  ■  ■  Cn]T ■  The  dual  optimization 
problem  £>d+i)  0f  pd+i)  is  as  follows. 

N  N  N 

,y'  ,/y"  2EE  VnVMn  +  in)  Hm  +  7 "  )A(x„,  Xm)  -  /i  ^(7n  +  in) 

’  n— 1  m=  1  n=l 

f-/?nW  <7^<  n=l...N 

s.t.  I  -/?"(/)  <  7"  <  -  /3"(i)  n  =  1 . . .  N 

lEn=l  Vn( in  +  in)  =  0  En=l(7n  ~  7n)  =  0 

where  7'  =  [7J  72 . in]T  and  7"  =  [7"  72 . 7"]r  are  dual  variables. 

At  the  optimality  of  Pd+1),  w  can  be  found  as  w  =  EEi  vHn  +  7ra)</’(xn)- 
Since  pd+1)  has  quadratic  objective  and  linear  constraints,  it  holds  strong 
duality  with  D^l+1\  Solving  Pd+1)  is  more  useful  as  it  can  be  easily  kernelized 
for  non-linear  problems.  Behavior  of  in  and  7"  under  different  cases  is  as  follows. 


Un 

(w 

T0(x„ 

) 

+ 

b)- 

-  p 

> 

p 

7n 

=  -d 

;d).  n  . 

n  j  2  71 

-fin™ 

IJn 

(w 

T</>(x„ 

) 

+ 

b)- 

-  p 

= 

p 

7n 

e  (- 

/o'(0  Cd 

>  /X 

-/ 

?;(0);  7;'  = 

y-n 

(w 

T</>(x„ 

) 

+ 

b)- 

-  p 

G 

(- 

-p,p) 

in 

_  Cd 

7n 

=  -f3'il) 

y-n 

(w 

T</>(x„ 

) 

+ 

b)- 

-  p 

= 

p 

in 

_  Cd 

7n 

G  (-/?"« 

y-n 

(w 

T</>(x„ 

) 

+ 

b)- 

-  p 

< 

p 

in 

_  Cd 

in 

 C(l-d) 

C(l-ri) 


3.2  Finding  fod+t)  and  pOT1) 

To  find  &d+t)  ancj  pd+1))  we  consider  x„  €  SV,^+1’)  U  SV,,^+1\  where 
SV'd+i)  =  {Xn  |  J^(x/W(,+1)  +  6d+i))  =  pd+i)  + 
SV"d+1)  =  {x„  I  2/n(0(x„)Twd+1)  +  fed+D)  =  -pd+1)  +  p} 


We  already  saw  that 


1.  If  x„  G  SV,(Z+1),  then  7;(/+1)  G  (  -  /&(I),  & 

2.  If  xn  G  SV"d+1),  then  7^+1)  =  ^  -  /&(,) 

A4 


-£n(0)  and  7n(*+1)  = 

and  7"('+1)  e  (  -  /3"(0, 


'"(0 


C(l-rf) 

A4 


We  solve  the  system  of  linear  equations  corresponding  to  sets  SV,d+1l  and 
SV,7(/+1)  for  identifying  fed+D  and  pd+1). 


3.3  Summary  of  the  Algorithm 

We  fix  d  G  [0,  .5],  p  G  (0,1]  and  C  and  initialize  the  parameter  vector  0  as 
@(°).  In  any  iteration  (2),  we  find  P'n\ /3 'nl\  n  =  1 ...  IV  (see  Eq.  (6)).  We  solve 
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£)(*+!)  to  find  7,^+1),  ■y”(l+1) .  w(z+1)  is  found  as  w^+1^  =  J2n=i  Vn("/n !+1^  + 
7 n(*+1))</>(x„).  We  find  and  p(l+1')  as  described  in  Section  3.2.  Thus,  we 

have  found  0^l+1\  Using  0^+1\  we  now  find  pnl+1\  0nl+1\  n  =  1...A.  We 
repeat  the  above  two  steps  until  the  parameter  vector  0  changes  significantly. 
More  formal  description  of  our  algorithm  is  provided  in  Algorithm  1. 


Algorithm  1.  Learning  Reject  Option  Classifier  by  Minimizing  R(0) 

Input  :  d  6  [0,  .5],  p  £  (0, 1],  C  >  0,  S 
Output  :  w *,b*,p* 

Initialize  w®,  b^°\  p^°\  l  =  0 

repeat 

Compute  0nl)  =  7rI{!/„(^(xn)'rw(i)+6(i))_p(0<-M2} 
a"W  _  C(l-d)¥ 

’  n  /t  {Vn  (0(xn)Tw(i)  )+p(0  <  —  pi2 } 

Find  by  solving  lA+b  described  in  Eq.  (7) 

Find  w(,+1)  =  E^=i2/»(7»*+1)  +  7n(i+1))</>(x„) 

Find  and  p<'l+1')  by  solving  the  system  of  linear  equations  corresponding  to 

sets  SV^+1^  and  SVj+1\  where 

SV,(i+1)  =  |  yn(0(x„)Twa+1)  +  6(i+1))  =  p(!+1)  +  p} 

SV"('+D  =  {xn  |  yn(</»(x„)T w(i+1)  +  6(i+1))  =  -p(!+1)  +  m} 

until  convergence  of  <9^ 


3.4  7'  and  7"  at  the  Convergence  of  Algorithm  1 

At  the  convergence  of  Algorithm  1,  let  7'* ,  7"* ,  n  =  1 ...  A  become  the  values 
of  the  dual  variables.  The  behavior  of  7^*  and  7"*  is  described  in  Table  2.  For 
any  x„,  only  one  of  "fr*  and  7"*  can  be  nonzero.  We  observe  that  parameters 
w,6  and  p  are  determined  by  the  points  whose  margin  (y/(x))  is  in  the  range 
\p  —  p2,  p  +  p ]  U  [— p  —  p2,  —p  +  p].  We  call  these  points  as  support  vectors.  We  also 
see  that  x„  for  which  y„/(x„)  £  (p  +  p,  00)  U  (—p  +  p,  p  —  p2)  U  (—00,  —p  —  p2), 
both  7'* ,  7"*  =  0.  Thus,  points  which  are  correctly  classified  with  margin  at  least 
(p  +  p),  points  falling  close  to  the  decision  boundary  with  margin  in  the  interval 
(—p  +  p,  p—  p2)  and  points  misclassified  with  a  high  negative  margin  (less  than 
—p  —  p2),  are  ignored  in  the  final  classifier.  Thus,  our  approach  not  only  rejects 
points  falling  in  the  overlapping  region  of  classes,  it  also  ignores  potential  outliers. 
We  illustrate  these  insights  through  experiments  on  a  synthetic  dataset  as  shown 
in  Fig.  3.  400  points  are  uniformly  sampled  from  the  square  region  [0  1]  x  [0  1]. 
We  consider  the  diagonal  passing  through  the  origin  as  the  separating  surface 
and  assign  labels  {— 1,+1}  to  all  the  points  using  it.  We  changed  the  labels  of 
80  points  inside  the  band  (width=0.225)  around  the  separating  surface. 

Fig.  3  shows  the  reject  option  classifier  learnt  using  the  proposed  method.  We 
see  that  the  proposed  approach  learns  the  rejection  region  accurately.  We  also 
observe  that  all  of  the  support  vectors  are  near  the  two  parallel  hyperplanes. 
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Table  2.  Behavior  of  ~y'*  and  -y"* 


Condition 

7n  e 

7n*  e 

y„(wJ  0(x„)  +  b)  e  (p  +  p,  oo ) 

0 

0 

y„(wi  </>(x„)  +  b)  =  p  +  p 

(0.^) 

0 

»n(wT((l(x„)  +  6)  €  [p  ~  p2,  p  +  p) 

(Jd 

0 

!/„(w'>(x„)  +  b)  €  (-p  +  p,p-  p 2) 

0 

0 

y„(wT0(x„)  +  b)  =  -p  +  p 

0 

(0,^) 

y„(wT0(x„)  +  b)  €  [-p  -  p2,  -p  +  p) 

0 

0(1  — dj 

y„(wJ  0(x„)  +  6)  e  (-oo,  -p  -  p2) 

0 

0 

Data  with  Label  Noise 
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Fig.  3.  Left  figure  shows  that  label  noise  affects  points  near  the  true  classification 
boundary.  Right  figure  shows  reject  option  classifier  learnt  using  Ldr  based  approach 
(C  =  100,  p  =  1,  d  =  .2).  Filled  circles  and  triangles  represent  the  support  vectors. 

4  Experimental  Results 

We  show  the  effectiveness  of  our  approach  by  showing  its  performance  on  several 
datasets.  We  also  compare  our  approach  with  the  approach  proposed  in  [7]. 

4.1  Dataset  Description 

We  report  experimental  results  on  1  synthetic  datasets  and  2  datasets  taken 
from  UCI  ML  repository  [2]. 

1.  Synthetic  Dataset  :  Let  /i  and  /2  be  two  mixture  density  functions  in  R2 
defined  as  follows: 


/i(x)  =0.45W([1,0]  x  [1,1])  +  0.5W([4,3]  x  [0, 1])  +  0.05W([10, 0]  x  [5,5]) 
/2(x)  =  0.45W([0,1]  x  [1,1])  +  0.5W([9,10]  x  [1,0])  +  0.05W([0, 10]  x  [5,5]) 
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where  U(A)  denotes  the  uniform  density  function  with  support  set  A.  We 
sample  150  points  independently  each  from  /i  and  fi-  We  label  these  points 
using  the  hyperplane  with  w  =  [1  0]T  and  5  =  0.  We  choose  10%  of  these 
points  uniformly  at  random  and  flip  their  labels. 

2.  Ionosphere  Dataset  [2]  :  This  dataset  describes  the  problem  of  discrimi¬ 
nating  good  versus  bad  radars  based  on  whether  they  send  some  useful  infor¬ 
mation  about  the  Ionosphere.  There  are  34  variables  and  351  observations. 

3.  Parkinsons  Disease  Dataset  [2]  :  This  dataset  is  used  to  discriminate 
people  with  Parkinsons  disease  from  the  healthy  people.  There  are  195  fea¬ 
ture  vectors  with  each  vector  having  22  features. 

4.2  Experimental  Setup 

In  the  proposed  Tdr  based  approach,  for  solving  the  dual  D”>  at  every  iteration, 
we  have  used  the  kernlab  package  [9]  in  R.  We  thank  the  authors  of  Ldh  based 
method  [7]  for  providing  the  codes  for  their  approach.  For  nonlinear  problems, 
we  use  RBF  kernel.  In  our  approach,  we  set  n  =  1.  C  and  a  (width  parameter 
for  RBF  kernel)  are  chosen  using  10-fold  cross  validation. 

4.3  Simulation  Results 

We  report  results  for  values  of  d  in  the  interval  [0.05  .5]  with  the  step  size  of  0.05. 
For  every  value  of  d ,  we  find  the  cross  validation  risk  (under  Lo-d-i)>  %  accuracy 
on  the  non-rejected  examples  (Acc)  and  %  rejection  rate  (RR).  The  results 
provided  are  based  on  10  repetitions  of  10-fold  cross  validation  (CV).  We  show 
the  average  values  and  standard  deviation  (computed  over  the  10  repetitions). 

We  now  discuss  the  experimental  results.  Fig.  4(a)  shows  the  Synthetic 
dataset  and  the  true  classification  boundary.  Fig.  4(b)  and  (c)  show  the  clas¬ 
sifiers  learnt  using  Ldr  and  Ldh  based  approaches  respectively  for  d  =  0.2.  Ldr 
based  approach  accurately  finds  the  true  classification  boundary  as  oppose  to 


Fig.  4.  (a)  Synthetic  Dataset  and  the  true  classification  boundary.  Reject  option  clas¬ 
sifiers  learnt  using  (b)  proposed  Ldr  based  approach  for  d  =  0.2,  (c)  Ldh  based 
approach  for  d  =  0.2. 
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Table  3.  Comparison  results  on  Synthetic  dataset  (linear  classifiers  for  both  the 
approaches) 


d 

Ldr  (C  =  2) 

Ldh  (C  =  32) 

Risk 

RR 

Acc(unrej) 

Risk 

RR 

Acc(unrej) 

0.05 

0.068i0.015 

90.87i5.79 

75.87i7.95 

0.05 

100 

NA 

0.1 

0.138T0.023 

70.35il2.18 

79.05i6.87 

0.105i0.002 

95.53il.69 

77.20i6.06 

0.15 

0.135T0.003 

65.41i5.06 

89.66i0.90 

0.136 

72.77i0.23 

90.56i0.66 

0.2 

0.155T0.006 

43.18i4.31 

88.56i0.75 

0.17 

72.67 

90.36il.44 

0.25 

0.164T0.014 

32.13i8.43 

87.97il.42 

0.204i0.003 

66.5il.7 

91i0.74 

0.3 

0.148i0.012 

13.23i7.52 

87.67i0.69 

0.197 

46.73i0.14 

89.37i0.32 

0.35 

0.134T0.005 

4.57il.80 

87.68i0.23 

0.21i0.002 

43.33i0.65 

90.02i0.38 

0.4 

0.131T0.003 

1.51i0.56 

87.29i0.30 

0.21i0.006 

31.17il.26 

87.41i0.55 

0.45 

0.128i0.002 

0.86i0.45 

87.45i0.25 

0.265i0.008 

9.13il.l 

75.58i0.98 

0.5 

0.136T0.01 

0 

86.41i0.99 

0.297i0.004 

0 

70.27i0.44 

Table  4.  Comparison  results  on  Ionosphere  dataset  (nonlinear  classifiers  using  RBF 
kernel  for  both  the  approaches) 


d 

Ldr  (C  =  2,  7  =  0.125) 

Ldh  (C  =  16,  7  =  0.125) 

Risk 

RR 

Acc(unrej) 

Risk 

RR 

Acc(unrej) 

0.05 

0.025i0.002 

34.84i0.92 

98.94i0.31 

0.029 

52.61i0.73 

99.47i0.06 

0.1 

0.027i0.003 

8.81i0.32 

97.99i0.33 

0.047i0.002 

43.44i0.85 

99.46i0.17 

0.15 

0.039i0.003 

5.78i0.57 

96.81i0.29 

0.042i0.003 

24.02il.62 

99.3i0.37 

0.2 

0.044i0.001 

3.46i0.51 

96.18i0.15 

0.04i0.002 

17.43i0.59 

99.42i0.25 

0.25 

0.047i0.002 

1.76i0.41 

95.68i0.23 

0.046i0.001 

14.47i0.79 

98.9i0.16 

0.3 

0.052i0.003 

0.92i0.46 

95.08i0.35 

0.051i0.003 

12.57i0.75 

98.56i0.31 

0.35 

0.051i0.003 

0.03i0.09 

94.88i0.29 

0.054i0.002 

9.33i0.59 

97.72i0.21 

0.4 

0.051i0.002 

0 

94.95i0.24 

0.054i0.003 

6.72i0.86 

97.09i0.35 

0.45 

0.054i0.002 

0 

94.64i0.21 

0.055i0.003 

3.53i0.41 

95.97i0.36 

0.5 

0.054i0.001 

0 

94.62i0.13 

0.055i0.005 

0 

94.55i0.47 

Ldh  based  approach.  Also,  the  reject  region  found  by  Ldr,  based  approach  is 
the  most  ambiguous  region  unlike  Ldh  based  approach  which  rejects  almost  all 
the  points. 

Table  3-5  show  the  experimental  results  on  all  the  datasets.  We  observe  the 
following: 

1.  We  see  that  the  proposed  LDr  based  method  outperforms  LDh  based  app¬ 
roach  in  terms  of  the  risk  (expectation  of  Lo-d-i).  For  Synthetic  dataset, 
except  for  d  =  0.05  and  0.1,  Ldr  based  method  has  lower  CV  risk.  Similarly, 
for  Ionosphere  dataset,  except  for  d  =  0.2,  0.25  and  0.3,  Ldr  based  method 
has  lower  CV  risk.  For  Parkinsons  dataset,  Ldr  based  method  has  lower  CV 
risk  except  for  d  =  0.35. 

2.  We  also  observe  that  Ldr,  based  method  outputs  classifiers  with  significantly 
lesser  rejection  rate  for  all  the  datasets  and  for  all  values  of  d. 

Thus,  the  proposed  Ldr  based  approach  outputs  classifiers  with  lesser  risk  and 
lesser  rejection  rate  compared  to  the  Ldh  based  approach. 
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Table  5.  Comparison  results  on  Parkinsons  Disease  dataset  (linear  classifiers  for  both 
the  approaches) 


d 

Idr  ( C  =  32) 

Idh  ( C  =  32) 

Risk 

RR 

Acc(unrej) 

Risk 

RR 

Acc(unrej) 

0.05 

0.031i0.002 

43.88i0.80 

98.33i0.49 

0.043i0.001 

86.38i0.92 

100 

0.1 

0.051i0.004 

41.79i0.77 

98.07il.03 

0.061i0.002 

53.76il.64 

98.61i0.62 

0.15 

0.071i0.002 

40.08il.21 

98.14i0.48 

0.086i0.004 

39.56il.13 

95.8i0.72 

0.2 

0.095±0.004 

37.67il.04 

96.99i0.55 

0.125i0.008 

29.78i2.06 

90.86il.5 

0.25 

0.133i0.009 

20.46i2.79 

90.26il.30 

0.142i0.004 

22.3il.95 

89.02i0.73 

0.3 

0.129i0.01 

4.06i2.06 

87.83il.15 

0.131i0.009 

14.19il.05 

89.76il.01 

0.35 

0.134i0.007 

2.49il.04 

87.19i0.76 

0.133i0.004 

9.97il.l8 

89.10i0.57 

0.4 

0.131i0.008 

0.56i0.44 

87.06i0.75 

0.133i0.006 

6.10il.62 

88.53i0.92 

0.45 

0.133i0.013 

0.05i0.17 

86.72il.28 

0.14i0.009 

2.92il.09 

86.96il.05 

0.5 

0.133±0.009 

0 

86.65i0.94 

0.139i0.008 

0 

86.06i0.76 

5  Conclusion  and  Future  Work 

In  this  paper,  we  have  proposed  a  new  loss  Ldr  ( double  ramp )  for  learning  the 
reject  option  classifier.  Ldr  gives  tighter  upper  bound  for  Lo-d-i  compared  to 
convex  losses  Ldh  and  ^gh-  Our  approach  learns  the  classifier  by  minimizing  the 
regularized  risk  under  the  double  ramp  loss  which  becomes  an  instance  of  DC 
optimization  problem.  Our  approach  can  also  learn  nonlinear  classifiers  by  using 
appropriate  kernel  function.  Experimentally,  we  have  shown  that  our  approach 
works  superior  to  Ldh  based  approach  for  learning  reject  option  classifiers. 
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